Testing pathways to scale: study protocol for a three-arm randomized controlled trial of a centralized and a decentralized (“Train the Trainers”) dissemination of a mental health program for Kenyan adolescents

Background Providing care in Kenya to all youth in need is difficult because of a shortage of professional providers and societal stigma. Previous trials of the Anansi model, which involves delivering low-touch mental health interventions through a tiered caregiving model (including lay-providers, supervisors, and clinical experts), have shown its effectiveness for reducing depression and anxiety symptoms in school-going Kenyan adolescents. In this trial, we aim to assess two different scale-up strategies by comparing centralized implementation (i.e., by the organization that designed the Anansi model) against implementation through an implementing partner. Methods In this three-arm trial, 1600 adolescents aged 13 to 20 years will be randomized to receive the Shamiri intervention from either the Shamiri Institute or an implementation partner or to be placed in the treatment as usual (TAU) control group. The implementation partner will be trained and supplied with protocols to ensure that the same procedures are followed by both implementors. Implementation activities will run concurrently for both implementors. The Shamiri intervention will be delivered by trained lay providers to groups of 10–15 adolescents over four weekly sessions which will take place in secondary schools in Machakos and Makueni counties in Kenya. The TAU group will receive the usual care offered by their respective schools. Outcomes will be assessed at baseline, midpoint (2 weeks), endpoint (4 weeks), and 1 month follow-up. The analysis will be based on an intent-to-treat approach. Mixed effects models will be used to assess trajectories over time of the primary outcomes (anxiety and depressive symptoms, mental well-being, perceived social support, and academic performance) and secondary outcomes for the intervention groups and the control group. Effect sizes will be computed for the mean differences of the intervention and control arms at midpoint, endpoint, and follow-up. Discussion This trial will provide insight into the comparative effectiveness of different strategies for scaling a school-based mental health care model. Findings will also indicate areas for improved efficiency of the model to enhance its replicability by other implementors. Trial registration Pan African Clinical Trials Registry (PACTR) (ID: PACTR202305589854478, Approved: 02/05/2023). Supplementary Information The online version contains supplementary material available at 10.1186/s13063-023-07539-y.


Introduction
Here we document the power analysis and sample size determination (on level-2, the level of participants) for the Anansi study. Overall this is a difficult undertaking as we need to consider: Complicated nesting structure which we must tackle with multilevel models.
Attrition in the data that is only partially matched by the data.
Two outcome measures (PHQ and GAD) that do not behave the same but we need to have only one sample. Different effects and attrition at different time points but we need one longitudinal sample.
We do sample size calculations under a sort of bad-case scenario. We want to make sure we achieve at least a target power of P % under our assumptions. Therefore we aim for the minimum sample size we need to get under a non-optimal but realistic scenario (attrition, effect sizes, etc.); this estimate will therefore be conservative. In practice and if the situation is better than assumed, a smaller sample size might work just as well (say, if attrition is less, if effect sizes are larger, if noise is lower in reality, etc) but there's no guarantee this better situation will happen. So in a way what we do here is risk management-we hedge against realistic but suboptimal outcomes in a study by making sure we have a sample size large enough. If things get even worse than we expect of course we'd be too optimistic.
Notationwise, I'll distinguish between the effective sample size (n ef f ) that is the sample size returned by the simulations and the necessary minimal sample size to achieve the effective sample size given the attrition asumptions, n * = a * n ef f , with a an attrition correction factor described below. The power numbers given always correspond to an effective sample size (or the corresponding n * ). We need to use the n * to know how many participants we need at least.

Summary
In Table 1 I list the necessary minimal sample sizes n * , effective sample sizes n ef f , the standardized d and unstandardized b effect sizes for which we expect at least a power of 80% over all time points and both dependent variables under the assumptions/setup discussed below. I also list the dependent variable associated with that n * ; we always just need to take the smallest standardized effect over both DV into account to get the minimal necessary sample size for a given power because for the DV with the larger effect the power will be higher for this sample size anyway. We also list the 95% confidence interval (CI) of the estimated power at that scenario and which time point it was (if there will be the decision to be made to not do it over all time points but just for, say, EP).  Table 1: Minimal sample size n * , effective sample size n ef f , unstandardized effect size b and standardized effect size d for which a power of at least 0.8 is achieved at any time point (over both dependent variables). The dependent variable that gave rise to that sample size is given in the column DV and the time point at which this happened is given in column Time.
The table is to be read like this: Use n * = 1554 to make sure that we have at least 80% power over all the assumed effects over all three time points for both DV (lowest d overall). This is the information we're after and my recommendation.
But we may want to be more flexible and decide we no longer care about an effect size at MP; then this table lists other n * that may be of interest at which point we'd have around 80% power for other effect sizes. For example, say we decide to only care about EP, then for this assumed effect size a n * = 666 will suffice for both DV (the power at the other time points with that sample size is reported below). If we care about a d = 0.25 over any time point and any DV we would use n * = 1443; if we'd decide to only care about GAD (not PHQ) then n * = 1221 would suffice for a d = 0.25. If we'd care about an assumed unstandardized effect of b = −1 over any time point for PHQ only we'd use n * = 1443 (the one for FU) but if we only cared about −1 at EP, n * = 1110 would suffice.

Assumptions and Status Quo
Here we document the assumptions that we make for the sample size determination and power analysis as well as discuss the status quo.

Status Quo
The design that is planned is that we have students (participants or kids, Participant_ID) from different counties and schools randomly assigned to a control (TAU) or a treatment condition (Factor Condition). The treatment condition gets the Shamiri intervention. We measure students symptoms with the PHQ-8 and the GAD-7 over the course of 8 weeks (factor Time): Baseline (BL, 0 weeks in), midpoint (MP, 2 weeks in), end point of the intervention (EP, 4 weeks in) and at 1 Month follow-up (FU, 8 weeks in). We're specifically interested in the power of tests of significance between the treatment condition and the control condition (called TAU) at all time point other than BL for different sample sizes and the minimum sample size that is estimated to give us a specific power.
There will be two implementers: Shamiri Institute (SHA) and AMHRTF (AMH). In this analysis we'll check power/sample size of TAU vs. treatment and pretend that the effects of SHA and AMHRTF are the same. This will give one power analysis/sample size determination.
We have a pre-study that we use as our blueprint: The JAMA Psychiatry study with data from 2019. The nesting structure is assumed to be the same, so there are repeated observations per student that are nested in students. Students are nested in administration groups (where all got the intervention or the control; factor Group) which are nested in nested within their group leaders (Group_Leader). The groups are also nested within schools (so no administrative group consists of students from different schools; factor School). The Group_Leader is crossed with school (as each leader could have groups in more than one school).
We have the control variables Age, Gender and Form. As dependent variables (DV) we'll use the sum score of PHQ and GAD respectively.
Based on that we have the following model we fit (in lmer pseudo code): > DV~Age + Gender + Form + Time * Condition > + (1|Participant_ID:Group:Group_Leader)+ (1|Group:Group_Leader)+ (1|Group_Leader) + (1|School) Since we're interested in the power of tests of significance between the control condition (called TAU), and the treatment condition (SH) we are interested in mean(ConditionSH)mean(ConditionTAU) at MP, EP and at FU. This is easiest to determine by reparametrizing the model so that EP, MP and FU are reference levels in turn and just look at the t-test of ConditionSH=0 (assuming the reference for Condition is TAU) because we have mean(conditionSH) = Intercept+ConditionSH and mean(conditionSH)=Intercept. Otherwise it would mean looking at ConditionSH=0 at reference level (say EP) and at ConditionSH+ConditionSH:TimeFU=0 as mean(conditionSH,FU)=Intercept+TimeFU and mean(conditionSH,FU)=Intercept+ConditionSH+TimeFU+ConditionSH:TimeFU and we'd need to look at a generalized linear hypothesis test for that effect. For MP that would be equivalent just swap the TimeFU for TimeMP.
Since we must do this with multilevel models, there are no closed-form solutions and we have to simulate. I'll use the R package simr for this.

Assumptions and Parameters
Power/sample size determination for these models entails making many assumptions about hypothesized effect sizes and other parameters. I'll use values we obtained in the previous studies without imputation. Note we don't have a 1 month follow-up in the 2019 study, so I equate the 2 week follow-up in the JAMA study in 2019 with the 1 month follow-up (so we equate FU in Anansi with 2WFU in 2019).
Note that due to our having to test fixed effects, we must not use the REML criterion for estimation in mixed-effect models, so the effects (mainly random effects) will differ a bit between what I have here and what is in the ShamiriDASupplement (where we used REML). That shouldn't make a big difference for the sample size/power for testing fixed effects.
For PHQ (based on FML not REML): Fixed effects: We expect the average values of DV at EP in the treatment group to be 6.4 for the reference kids (Gender=male, Form=1, Age=min(age)) (Intercept). Then for the reference kids in the control group at EP we expect a mean of 7.64, so the average score to be higher by 1.24 in control, so mean(treatment)−mean(control) = b = −1.24. At 1 month follow-up (FU) we expect a shift down for the treatment group of −1.03 to a mean of 5.37. For the control at FU we expect a decrease from EP to FU of −1.13 (−1.03− 0.1) to a mean of 6.51. This makes the difference between the treatment and control group at FU to be b = −1.13 (treatment-control). We thus calculate power/sample size with a hypothesized effect size between control and treatment of b = −1.24 at EP and of b = −1.13 at FU. At MP the difference of the means of treatment and control is b = 0.89 (so control did better than treatment). These unstandardized effect sizes roughly translate to d = 0.29 at EP, d = 0.27 at FU and d = 0.22 at MP respectively when standardizing them by dividing with the residual standard error. All other fixed effects are assumed to be as in the JAMA study.
Random effects: The residual standard error we expect to be 4.09. The standard deviation of the random effects we expect to be 2.32 for the intercept of Participant_ID, 0.28 for the intercept of Group_Leader and 0.45 for the intercept of School. For Group it is 0 (or ignored).
Attrition: We expect 30% attrition at midpoint and endpoint (individually, not cumulative) and 40% at follow-up (individually, no cumulative), so this needs to be a corrective factor. This is extremely high attrition.
For GAD (based on FML not REML): Fixed effects: We expect the average values of DV at EP in the treatment group to be 6.6 for the reference kids (Gender=male, Form=1, Age=min(age)). Then for the reference kids in the control group at EP we expect the average score to be higher by b at FU. These unstandardized effect sizes roughly translate to d = 0.44 and d = 0.28 respectively when standardizing them by dividing with the residual standard error. All other fixed effects are assumed to be exactly as in the JAMA study. At MP the difference in the means between treatment and control is so close to zero as to not be practically relevant (d = 0.15) and therefore ignored subsequently.
Random effects: The residual standard error we expect to be 3.8. The standard deviation of the random effects we expect to be 2.1 for the intercept of Participant_ID, 0.4 for the intercept of Group and 0.0005 for the intercept of School. For Group_Leader it is 0.3.
Attrition: We expect 30% attrition at midpoint and endpoint (individually, not cumulative) and 40% at follow-up (individually, no cumulative), so this needs to be a corrective factor. This is extremely high attrition.
Note that from EP to FU we assume further improvement in symptoms as this is how it presents itself in the previous data. For testing the effect between control and treatment at each time point, it doesn't really matter what the pattern (down or up) from EP to FU is as long as the pattern would be seen in both groups. We use the data from the JAMA study in 2019 as the blueprint. Overall we had 413 participants in that study. We didn't have a 1 month follow-up (FU) as is planned in Anansi, but a 2 week follow-up (2WFU). I'll hypothesize that what was observed 2WFU in 2019 is also what we'll observe at the FU in Anansi.
I've been instructed that we want to determine the necessary minimum sample size for all differences between Control and Treatment condition at all the time points MP and EP and FU. Note we don't want this for BL because we randomize at BL; we don't want to have significant differences in terms of the outcome before we start with a treatment because then the randomization wouldn't have worked.
Having tests at three time points leads to the following observation: Let us for the moment assume there is no attrition. Then we will observe the same participants at each time point as it is a longitudinal study. This means the necessary sample size for differences in condition at different time points cannot be changed (the participants with intent to be treated need to be recruited beforehand and stay the same). This means that we cannot have different sample sizes at different time points (again, if we had no attrition). This is important because this implies we only need to find the necessary sample size to detect the smallest difference between treatment and control condition that we observed over any of the time points we're interested in MP, EP and 2WFU.
Since both DV will be used for the same students, we need to select the maximum number of students per timepoint over both DV to get the overall sample size n * .
Note since this is a simulation, the concrete values can vary a bit between simulations. I'll set a seed for reproducibility. If you need to make sure the estimated power is conservative, then one can use the low boundary of the CI as the associated power (and perhaps a higher sample size so that the lower bound of the CI exceeds the target power of 80%).
Due to how the software we're using actually works, we have to remove missing data prior to running the simulations. This means that the effects estimated on the data without missings would differ slightly because there are some participants for which we lose all the observations; in the model in 2019 they don't need to be removed and therefore have an impact due to the partial pooling effect of mixed models. To accomodate that I'll use the object we got from removing the missings and add to it the fixed effects and random effects statistics from the 2019 study; this makes the whole thing doable and luckily doesn't effect estimates in terms of the standardized effect sizes at MP.
Removing the missings prior to the simulations means that we carry the attrition at each time point with us automatically (the software can't deal with NAs). Internally in the simulations the number of participants gets replicated randomly (upscaled) to achieve the final sample size for simulations. This means the number of missings of each participant will be upscaled as well. This is the effect: due to the original participant number being 405 observations (after removal of complete missings) we potentially have 405 observations per time point. Since we have 5 time points in the data, that means potentially 2025 level-1 observations (measurements per person). Note that in the JAMA data we actually have 408 kids and thus potentially 2065, but 3 kids drop out completely after removal of missings so we'll work without these 3. Now, say in our simulations we want to upscale to 2000 people instead of 405; of the 405 people we replicate as many people as we need to now have 2000 and add them to the data set. But not all people had full records on all time points, so we do not have 2000 * 5 = 10000 level-1 observations, because depending on which person gets replicated, we also replicate the missing values of that person. So if we had an average missing values percentage of 15% over all people in the data, then upscaling to 2000 students effectively means upscaling to 10000 − (10000 * 0.15) = 8500 level-1 observations instead of 10000. This is how the attrition/missings in the JAMA study carries over into the simulation. Unfortunately, we cannot fully control in the software how much attrition is carried over in the simulation at each time point to match exactly the observed attrition of each time point; but due to the random replication of people it should vary only slightly around the average, which is about 30% overall (1431 level-1 observations as compared to 2025 if we had full records of everyone). Therefore with respect to sample size determination, the number of participants obtained via the simulation only needs to be corrected by the excess number of attrition that we assume as compared to what is carried through the simulation.
We assume an attrition from BL to EP of 40% and of 30% from BL to FU. The attrition carried over into the simulation is around 30% both for PHQ and GAD. For both PHQ and GAD, this is therefore an excess attrition of 0% from BL to MP, 10% percentage points from BL to EP and 0% percentage points from BL to FU to the attrition already included in the model from 2019 that was carried through into the simulations (which was 30% on average). Thus we have to scale up the effective minimal sample size n ef f that we obtained in the simulation with a factor of a = 1/(1 − 0.10) = 1.11 to match the highest expected attrition of 40% EP and ensuring that the power obtained for n ef f is met by using n * = a * n ef f . For FU a = 1.
Overall therefore since we use the same sample of participants for GAD and PHQ, we just need to multiply whatever minimal sample size we get in the simulations under 2019 attrition with a = 1.11 as 1.11 * n ef f > 1 * n ef f to get a sample size that will have at least n e f f and therefore the target power with the attrition as assumed (same argument as before with the effect sizes). Note that this is again conservative because if attrition happens to be less, then that sample size will suffice (we will have more power then).
For the test, I'll use the t-statistic with the Satterthwaite ddf calculation (it is not yet clear in the literature which ddf should be used in general). At this level of observations for participants, the asymptotics should be ok and we shouldn't get too much divergence between different ddf calculations though (I usually prefer Kenward-Roger but we'd have to set up the model differently so as to isolate the effect instead of using Timef*Condition and this is much easier with Satterthwaite). As mentioned before I use α = 0.045 instead of 0.05 as Baayen et al. (2008) suggest.

PHQ
For PHQ we observed significant differences between Wellness (the Shamiri treatment condition) and Study-Skills (the control condition) at EP. We did not find a significant effect at MP. We did not find significant effects at 2WFU (but barely). These results are repeated subsequently with REML=FALSE. The fitted model was (now with REML=FALSE). That smallest effect is the effect of b = 0.89 at MP (see above). This roughly corresponds to a standardized effect size of d = 0.22 (dividing the effect by the residual standard error from the model above). There was also an interest in a standardized effect size of d = 0.3. I'll be running the simulation with d = 0.3 as well which is the effect size that corresponds to the effect at EP (since that one is larger we will have less power for the sample size derived that way for the test at MP). Note that at MP the control did better than the treatment (so lower symptom mean for Study-skills and therefore the difference TimefMP:ConditionWellness -TimefMP:ConditionStudy-skills is positive). It is also a good idea to do both effect sizes as the d = 0.22 is about 3/4 of the d = 0.3 so we hedge against a less optimistic outcome. In a nutshell therefore: The sample size determined with b = 0.89 will be the minimum sample size we need to detect any of our observed effects at MP, EP or FU in 2019.

Other Target Effect Sizes
For completeness, we'll also do the calculations with assuming the minimally relevant effect size anywhere to be like at EP (roughly like d = 0.3) and also with a hypothesized effect difference of b = 1 (d = 0.25).
Hypothesized effect of d ≥ 0.3 everywhere. First for a hypothesized effect that is always at least as large as at EP, so b = −1.24.
For this effective sample size, we'd have the following estimated power for a hypothesized effect size of b = 1 at MP and FU.
Personally, I think this is a hypothesized effect size that seems practically relevant without being overly optimistic (a small to medium effect of d = 0.25). For it we have reasonable power at all time points if we used an effective sample size of n ef f = 1000 (n * = 1110 participants with attrition correction). As we can see for n * = 1110 the power hovers between 0.73 and 0.83 over all time points which is uniformly relatively high (compare that to the signficant 2019 effect at EP, which had a power of 62% power to be detected). What we have here would be a good compromise between size of the sample, hypothesized effect size and detection power. If we can afford more participants, I'd aim at getting at least 80% power also at F U for an effect of 1; for this we'd need an effective sample size of about n ef f = 1300 (n * = 1443), see: > #when would we cross 80% for effect of 1 at FU? > phqmodelext.1500 <-extend(phq4simr, along="Participant_ID", n=1500) #we extend along the grouping > #variable Participant_ID, so we now have 1500 groups = Participants > phq.pc2one<-powerCurve(phqmodelext.1500,test=fixed("ConditionStudy-skills",method="t"), + along="Participant_ID", nsim=nsim,breaks=c(seq(1000,1500,by=100)), + alpha=0.045) > save(phq.ps2one,phq.ps3one,phq.ps4one,phq.pc1one,phq.pc2one,file="PHQPowerAtOne.rda") > phq.pc2one Power for predictor 'ConditionStudy-skills', (95% confidence interval), by number of levels

GAD
For GAD we observed significant differences between Wellness (the Shamiri treatment condition) and Study-Skills (the control condition) at EP. We did not find a significant effect at 2WFUP (but barely). At MP the effect size is so low that it is of no practical consequence (d < 0.2); I'll therefore ignore MP when doing the calculations for GAD. These results are repeated subsequently with REML=FALSE. The fitted model was (now with REML=FALSE). That smallest effect is the effect of b = −0.5 at MP (see above). This roughly corresponds to a standardized effect size of d = 0.13 (dividing the effect by the residual standard error from the model above). This is an effect that is too low to be of practical relevance and would need a really high n ef f , so I'll ignore MP and its size even though it was said we're also interested in MP (but there really is no point in an effect of d = 0.15). I'll therefore be running the simulation with FU as the minimum effect size, which was b = −1.05 (corresponding to d = 0.28): I'll also check power for the effect size that corresponds to the effect at EP but since that one is larger we will have more power anyway for the effective sample size obtained for FU (or would need less n for the same power).

Other Target Effect Sizes
For completeness, we'll also do the calculations with assuming the minimally relevant effect size to be a hypothesized effect difference of b = −0.95 (d = 0.25).
Hypothesized effect of d ≥ 0.25 everywhere. Now for a hypothesized effect of at least b = 0.95 score difference or higher at every time point.